Intro to Classifying Structured Data with TensorFlow

This notebook demonstrates classifying structured. The code presented here can become a starting point for a problem you care about. Our goal is to introduce a variety of techniques (especially, feature engineering) rather than to aim for high-accuracy on the demo dataset we'll explore.

Notes

  • If you run this notebook multiple times, you'll want to restore it to a clean state. When you run the notebook, the Estimators will write logs and checkpoint files to disk. These will be in a ./graphs directory in the same folder as this notebook. Delete this to restore to a clean state.
  • We'll demonstrate two types of input functions. First, the pre-built Pandas input function, and second, one written using the new Datasets API. At the time of writing (v1.3) the Datasets API is in contrib. When it moves to core (most likely in v1.4) we'll update this notebook.
In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections

import numpy as np
import pandas as pd

from IPython.display import Image

import tensorflow as tf
print('This code requires TensorFlow v1.3+')
print('You have:', tf.__version__)
This code requires TensorFlow v1.3+
You have: 1.3.0

About the dataset

Here, we'll work with the Adult dataset from the 1990 US Census. Our task is to predict whether an individual has an income over $50,000 / year, based attributes such as their age and occupation. This is a generic problem with a variety of numeric and categorical attributes - which makes it useful for demonstration purposes.

A great way to get to know the dataset is by using Facets - an open source tool for visualizing and exploring data. At the time of writing, the online demo has the Census data preloaded. Try it! In the screenshot below, each dot represents a person, or, a row from the CSV. They're colored by the label we want to predict ('blue' for less than 50k / year, 'red' for more). In the online demo, clicking on a person will show the attributes, or columns from the CSV file, that describe them - such as their age and occuptation.

In [2]:
Image(filename='./images/facets1.jpg', width=500)
Out[2]:
In [3]:
census_train_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
census_test_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'
census_train_path = tf.contrib.keras.utils.get_file('census.train', census_train_url)
census_test_path = tf.contrib.keras.utils.get_file('census.test', census_test_url)
Downloading data from https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
3923968/3974305 [============================>.] - ETA: 0sDownloading data from https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test
1875968/2003153 [===========================>..] - ETA: 0s

The dataset is missing a header, so we'll add one here. You can find descriptions of these columns in the names file.

In [4]:
column_names = [
  'age', 'workclass', 'fnlwgt', 'education', 'education-num',
  'marital-status', 'occupation', 'relationship', 'race', 'gender',
  'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
  'income'
]

Load the using Pandas

In the first half of this notebook, we'll assume the dataset fits into memory. Should you need to work with larger files, you can use the Datasets API to read them.

In [5]:
# Notes
# 1) We provide the header from above.
# 2) The test file has a line we want to disgard at the top, so we include the parameter 'skiprows=1'
census_train = pd.read_csv(census_train_path, index_col=False, names=column_names) 
census_test = pd.read_csv(census_test_path, skiprows=1, index_col=False, names=column_names) 

# Drop any rows that have missing elements
# Of course there are other ways to handle missing data, but we'll
# take the simplest approach here.
census_train = census_train.dropna(how="any", axis=0)
census_test = census_test.dropna(how="any", axis=0)

Correct formatting problems with the Census data

As it happens, there's a small formatting problem with the testing CSV file that we'll fix here. The labels in the testing file are written differently than they are in the training file. Notice the extra "." after "<=50K" and ">50K" in the screenshot below.

You can open the CSVs in your favorite text editor to see the error, or you can see it with Facets in "overview mode" - which makes it easy to catch this kind of mistake early.

In [6]:
Image(filename='./images/facets2.jpg', width=500)
Out[6]:
In [7]:
# Separate the label we want to predict into its own object 
# At the same time, we'll convert it into true/false to fix the formatting error
census_train_label = census_train.pop('income').apply(lambda x: ">50K" in x)
census_test_label = census_test.pop('income').apply(lambda x: ">50K" in x)

I find it useful to print out the shape of the data as I go, as a sanity check.

In [8]:
print ("Training examples: %d" % census_train.shape[0])
print ("Training labels: %d" % census_train_label.shape[0])
print()
print ("Test examples: %d" % census_test.shape[0])
print ("Test labels: %d" % census_test_label.shape[0])
Training examples: 32561
Training labels: 32561

Test examples: 16281
Test labels: 16281

Likewise, I like to see the head of each file, to help spot errors early on. First for the training examples...

In [9]:
census_train.head()
Out[9]:
age workclass fnlwgt education education-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba

... and now for the labels. Notice the label column is now true/false.

In [10]:
census_train_label.head(10)
Out[10]:
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7     True
8     True
9     True
Name: income, dtype: bool
In [ ]:
# Likewise, you could do a spot check of the testing examples and labels.
# census_test.head()
# census_test_label.head()

Estimators and Input Functions

TensorFlow Estimators provide a high-level API you can use to train your models. Here, we'll use Canned Estimators ("models-in-a-box"). These handle many implementation details for you, so you can focus on solving your problem (e.g., by coming up with informative features using the feature engineering techniques we introduce below).

To learn more about Estimators, you can watch this talk from Google I/O by Martin Wicke: Effective TensorFlow for Non-Experts. Here's a diagram of the methods we'll use here.

In [11]:
Image(filename='./images/estimators1.jpeg', width=400)
Out[11]:

You can probably guess the purpose of methods like train / evaluate / and predict. What may be new to you, though, are Input Functions. These are responsible for reading your data, preprocessing it, and sending it to the model. When you use an input function, your code will read estimator.train(your_input_function) rather than estimator.train(your_training_data).

First, we'll use a pre-built input function. This is useful for working with a Pandas dataset that you happen to already have in memory, as we do here. Next, we'll use the Datasets API to write our own. The Datasets API will become the standard way of writing input functions moving forward. It's in contrib in TensorFlow v1.3, but will most likely move to core in v1.4.

Input functions for training and testing data

Why do we need two input functions? There are a couple differences in how we handle our training and testing data. We want the training input function to loop over the data indefinitely (returning batches of examples and labels when called). We want the testing input function run for just one epoch, so we can make one prediction for each testing example. We'll also want to shuffle the training data, but not the testing data (so we can compare it to the labels later).

In [12]:
def create_train_input_fn(): 
    return tf.estimator.inputs.pandas_input_fn(
        x=census_train,
        y=census_train_label, 
        batch_size=32,
        num_epochs=None, # Repeat forever
        shuffle=True)
In [13]:
def create_test_input_fn():
    return tf.estimator.inputs.pandas_input_fn(
        x=census_test,
        y=census_test_label, 
        num_epochs=1, # Just one epoch
        shuffle=False) # Don't shuffle so we can compare to census_test_labels later

See the bottom of the notebook for an example of doing this with the new Datasets API.

Feature Engineering

Now we'll specify the features we'll use and how we'd like them represented. To do so, we'll use tf.feature_columns. Basically, these enable you to represent a column from the CSV file in a variety of interesting ways. Our goal here is to demostrate how to work with different types of features, rather than to aim for an accurate model. Here are five different types we'll use in our Linear model:

  • A numeric_column. This is just a real-valued attribute.
  • A bucketized_column. TensorFlow automatically buckets a numeric column for us.
  • A categorical_column_with_vocabulary_list. This is just a categorical column, where you know the possible values in advance. This is useful when you have a small number of possibilities.
  • A categorical_column_with_hash_bucket. This is a useful way to represent categorical features when you have a large number of values. Beware of hash collisions.
  • A crossed_column. Linear models cannot consider interactions between features, so we'll ask TensorFlow to cross features for us.

In the Deep model, we'll also use:

  • An embedding column(!). This automatically creates an embedding for categorical data.

You can learn more about feature columns in the Large Scale Linear Models Tutorial in the Wide & Deep tutorial, as well as in the API doc.

Following is a demo of a couple of the things you can do.

In [14]:
# A list of the feature columns we'll use to train the Linear model
feature_columns = []
In [15]:
# To start, we'll use the raw, numeric value of age.
age = tf.feature_column.numeric_column('age')
feature_columns.append(age)

Next, we'll add a bucketized column. Bucketing divides the data based on ranges, so the classifier can consider each independently. This is especially helpful to linear models. Here's what the buckets below look like for age, as seen using Facets.

In [16]:
Image(filename='./images/buckets.jpeg', width=400)
Out[16]:
In [17]:
age_buckets = tf.feature_column.bucketized_column(
    tf.feature_column.numeric_column('age'), 
    boundaries=[31, 46, 60, 75, 90] # specify the ranges
)

feature_columns.append(age_buckets)

You can also evenly divide the data, if you prefer not to specify the ranges yourself.

In [18]:
# age_buckets = tf.feature_column.bucketized_column(
#    tf.feature_column.numeric_column('age'), 
#    list(range(10))
#)
In [19]:
# Here's a categorical column
# We're specifying the possible values
education = tf.feature_column.categorical_column_with_vocabulary_list(
    "education", [
        "Bachelors", "HS-grad", "11th", "Masters", "9th",
        "Some-college", "Assoc-acdm", "Assoc-voc", "7th-8th",
        "Doctorate", "Prof-school", "5th-6th", "10th", "1st-4th",
        "Preschool", "12th"
    ])

feature_columns.append(education)

If you prefer not to specify the vocab in code, you can also read it from a file, or alternatively - use a categorical_column_with_hash_bucket. Beware of hash collisions.

In [20]:
# A categorical feature with a possibly large number of values
# and the vocabulary not specified in advance.
native_country = tf.feature_column.categorical_column_with_hash_bucket('native-country', 1000)
feature_columns.append(native_country)

Now let's create a crossed column for age and education. Here's what this looks like.

In [21]:
Image(filename='./images/crossed.jpeg', width=400)
Out[21]:
In [22]:
age_cross_education = tf.feature_column.crossed_column(
    [age_buckets, education],
    hash_bucket_size=int(1e4) # Using a hash is handy here
)
feature_columns.append(age_cross_education)

Train a Canned Linear Estimator

Note: logs and a checkpoint file will be written to model_dir. Delete this from disk before rerunning the notebook for a clean start.

In [23]:
train_input_fn = create_train_input_fn()
estimator = tf.estimator.LinearClassifier(feature_columns, model_dir='graphs/linear', n_classes=2)
estimator.train(train_input_fn, steps=1000)
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_summary_steps': 100, '_tf_random_seed': 1, '_session_config': None, '_model_dir': 'graphs/linear', '_keep_checkpoint_max': 5, '_log_step_count_steps': 100, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_secs': 600}
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into graphs/linear\model.ckpt.
INFO:tensorflow:step = 1, loss = 22.1807
INFO:tensorflow:global_step/sec: 178.554
INFO:tensorflow:step = 101, loss = 14.0466 (0.571 sec)
INFO:tensorflow:global_step/sec: 190.821
INFO:tensorflow:step = 201, loss = 13.717 (0.521 sec)
INFO:tensorflow:global_step/sec: 152.424
INFO:tensorflow:step = 301, loss = 18.9618 (0.654 sec)
INFO:tensorflow:global_step/sec: 254.427
INFO:tensorflow:step = 401, loss = 18.1447 (0.398 sec)
INFO:tensorflow:global_step/sec: 232.535
INFO:tensorflow:step = 501, loss = 15.8699 (0.425 sec)
INFO:tensorflow:global_step/sec: 236.383
INFO:tensorflow:step = 601, loss = 10.9236 (0.426 sec)
INFO:tensorflow:global_step/sec: 246.889
INFO:tensorflow:step = 701, loss = 13.2436 (0.405 sec)
INFO:tensorflow:global_step/sec: 208.747
INFO:tensorflow:step = 801, loss = 9.60209 (0.477 sec)
INFO:tensorflow:global_step/sec: 185.855
INFO:tensorflow:step = 901, loss = 21.85 (0.541 sec)
INFO:tensorflow:Saving checkpoints for 1000 into graphs/linear\model.ckpt.
INFO:tensorflow:Loss for final step: 12.4908.
Out[23]:
<tensorflow.python.estimator.canned.linear.LinearClassifier at 0x12f6d860>

Evaluate

In [24]:
test_input_fn = create_test_input_fn()
estimator.evaluate(test_input_fn)
WARNING:tensorflow:Casting <dtype: 'float32'> labels to bool.
WARNING:tensorflow:Casting <dtype: 'float32'> labels to bool.
INFO:tensorflow:Starting evaluation at 2017-10-31-22:07:27
INFO:tensorflow:Restoring parameters from graphs/linear\model.ckpt-1000
INFO:tensorflow:Finished evaluation at 2017-10-31-22:07:29
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.763774, accuracy_baseline = 0.763774, auc = 0.682711, auc_precision_recall = 0.358816, average_loss = 0.508887, global_step = 1000, label/mean = 0.236226, loss = 64.728, prediction/mean = 0.171827
Out[24]:
{'accuracy': 0.76377374,
 'accuracy_baseline': 0.76377374,
 'auc': 0.68271101,
 'auc_precision_recall': 0.35881615,
 'average_loss': 0.50888681,
 'global_step': 1000,
 'label/mean': 0.23622628,
 'loss': 64.72802,
 'prediction/mean': 0.17182657}

Predict

The Estimator returns a generator object. This bit of code demonstrates how to retrieve predictions for individual examples.

In [25]:
# reinitialize the input function
test_input_fn = create_test_input_fn()

predictions = estimator.predict(test_input_fn)
i = 0
for prediction in predictions:
    true_label = census_test_label[i]
    predicted_label = prediction['class_ids'][0]
    # Uncomment the following line to see probabilities for individual classes
    # print(prediction) 
    print("Example %d. Actual: %d, Predicted: %d" % (i, true_label, predicted_label))
    i += 1
    if i == 5: break
INFO:tensorflow:Restoring parameters from graphs/linear\model.ckpt-1000
Example 0. Actual: 0, Predicted: 0
Example 1. Actual: 0, Predicted: 0
Example 2. Actual: 1, Predicted: 0
Example 3. Actual: 1, Predicted: 0
Example 4. Actual: 0, Predicted: 0

What features can you use to achieve higher accuracy?

This dataset is imbalanced, so an an accuracy of around 75% is low in this context (this could be achieved merely by predicting everyone makes less than 50k / year). In fact, if you look through the predictions closely, you'll find that many are zero. We'll get a little smarter as we go.

Train a Deep Model

Add an embedding feature(!) and update the feature columns

Instead of using a hash to represent categorical features, here we'll use a learned embedding. (Cool, right?) We'll also update how the features are represented for our deep model. Here, we'll use a different combination of features that before, just for fun.

In [26]:
# We'll provide vocabulary lists for features with just a few terms
workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    'workclass',
    [' Self-emp-not-inc', ' Private', ' State-gov', ' Federal-gov',
     ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay', ' Never-worked'])

education = tf.feature_column.categorical_column_with_vocabulary_list(
    'education',
    [' Bachelors', ' HS-grad', ' 11th', ' Masters', ' 9th', ' Some-college',
     ' Assoc-acdm', ' Assoc-voc', ' 7th-8th', ' Doctorate', ' Prof-school',
     ' 5th-6th', ' 10th', ' 1st-4th', ' Preschool', ' 12th'])

marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    'marital-status',
    [' Married-civ-spouse', ' Divorced', ' Married-spouse-absent',
     ' Never-married', ' Separated', ' Married-AF-spouse', ' Widowed'])
     
relationship = tf.feature_column.categorical_column_with_vocabulary_list(
    'relationship',
    [' Husband', ' Not-in-family', ' Wife', ' Own-child', ' Unmarried',
     ' Other-relative'])
In [27]:
feature_columns = [

    # Use indicator columns for low dimensional vocabularies
    tf.feature_column.indicator_column(workclass),
    tf.feature_column.indicator_column(education),
    tf.feature_column.indicator_column(marital_status),
    tf.feature_column.indicator_column(relationship),

    # Use embedding columns for high dimensional vocabularies
    tf.feature_column.embedding_column(  # now using embedding!
        # params are hash buckets, embedding size
        tf.feature_column.categorical_column_with_hash_bucket('occupation', 100), 10),
    
    # numeric features
    tf.feature_column.numeric_column('age'),
    tf.feature_column.numeric_column('education-num'),
    tf.feature_column.numeric_column('capital-gain'),
    tf.feature_column.numeric_column('capital-loss'),
    tf.feature_column.numeric_column('hours-per-week'),   
]
In [28]:
estimator = tf.estimator.DNNClassifier(hidden_units=[256, 128, 64], 
                                       feature_columns=feature_columns, 
                                       n_classes=2, 
                                       model_dir='graphs/dnn')
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_summary_steps': 100, '_tf_random_seed': 1, '_session_config': None, '_model_dir': 'graphs/dnn', '_keep_checkpoint_max': 5, '_log_step_count_steps': 100, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_secs': 600}
In [29]:
train_input_fn = create_train_input_fn()
estimator.train(train_input_fn, steps=2000)
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into graphs/dnn\model.ckpt.
INFO:tensorflow:step = 1, loss = 261.936
INFO:tensorflow:global_step/sec: 207.448
INFO:tensorflow:step = 101, loss = 14.3686 (0.485 sec)
INFO:tensorflow:global_step/sec: 217.843
INFO:tensorflow:step = 201, loss = 13.3399 (0.459 sec)
INFO:tensorflow:global_step/sec: 236.943
INFO:tensorflow:step = 301, loss = 15.0694 (0.423 sec)
INFO:tensorflow:global_step/sec: 212.745
INFO:tensorflow:step = 401, loss = 9.14383 (0.470 sec)
INFO:tensorflow:global_step/sec: 238.071
INFO:tensorflow:step = 501, loss = 8.34114 (0.419 sec)
INFO:tensorflow:global_step/sec: 239.784
INFO:tensorflow:step = 601, loss = 13.7219 (0.418 sec)
INFO:tensorflow:global_step/sec: 246.889
INFO:tensorflow:step = 701, loss = 13.0856 (0.406 sec)
INFO:tensorflow:global_step/sec: 229.862
INFO:tensorflow:step = 801, loss = 11.7179 (0.436 sec)
INFO:tensorflow:global_step/sec: 218.319
INFO:tensorflow:step = 901, loss = 10.3873 (0.453 sec)
INFO:tensorflow:global_step/sec: 219.276
INFO:tensorflow:step = 1001, loss = 14.8484 (0.461 sec)
INFO:tensorflow:global_step/sec: 224.193
INFO:tensorflow:step = 1101, loss = 12.6935 (0.448 sec)
INFO:tensorflow:global_step/sec: 237.506
INFO:tensorflow:step = 1201, loss = 12.8023 (0.418 sec)
INFO:tensorflow:global_step/sec: 228.288
INFO:tensorflow:step = 1301, loss = 11.639 (0.438 sec)
INFO:tensorflow:global_step/sec: 218.319
INFO:tensorflow:step = 1401, loss = 11.8508 (0.454 sec)
INFO:tensorflow:global_step/sec: 214.571
INFO:tensorflow:step = 1501, loss = 11.8766 (0.466 sec)
INFO:tensorflow:global_step/sec: 217.843
INFO:tensorflow:step = 1601, loss = 6.8325 (0.460 sec)
INFO:tensorflow:global_step/sec: 223.691
INFO:tensorflow:step = 1701, loss = 15.5349 (0.447 sec)
INFO:tensorflow:global_step/sec: 213.654
INFO:tensorflow:step = 1801, loss = 8.45981 (0.468 sec)
INFO:tensorflow:global_step/sec: 215.961
INFO:tensorflow:step = 1901, loss = 9.2356 (0.464 sec)
INFO:tensorflow:Saving checkpoints for 2000 into graphs/dnn\model.ckpt.
INFO:tensorflow:Loss for final step: 16.7075.
Out[29]:
<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x13b317b8>
In [30]:
test_input_fn = create_test_input_fn()
estimator.evaluate(test_input_fn)
WARNING:tensorflow:Casting <dtype: 'float32'> labels to bool.
WARNING:tensorflow:Casting <dtype: 'float32'> labels to bool.
INFO:tensorflow:Starting evaluation at 2017-10-31-22:15:30
INFO:tensorflow:Restoring parameters from graphs/dnn\model.ckpt-2000
INFO:tensorflow:Finished evaluation at 2017-10-31-22:15:32
INFO:tensorflow:Saving dict for global step 2000: accuracy = 0.798784, accuracy_baseline = 0.763774, auc = 0.869684, auc_precision_recall = 0.628534, average_loss = 0.366575, global_step = 2000, label/mean = 0.236226, loss = 46.6267, prediction/mean = 0.255373
Out[30]:
{'accuracy': 0.79878384,
 'accuracy_baseline': 0.76377374,
 'auc': 0.86968398,
 'auc_precision_recall': 0.62853426,
 'average_loss': 0.36657533,
 'global_step': 2000,
 'label/mean': 0.23622628,
 'loss': 46.626663,
 'prediction/mean': 0.25537312}

That's a little better.

TensorBoard

If you like, you can start TensorBoard by running this from a terminal command (in the same directory as this notebook):

$ tensorboard --logdir=graphs

then pointing your web-browser to http://localhost:6006 (check the TensorBoard output in the terminal in case it's running on a different port).

When that launches, you'll be able to see a variety of graphs that compares the linear and deep models.

In [31]:
Image(filename='./images/tensorboard.jpeg', width=500)
Out[31]:

Datasets API

Here, I'll demonstrate how to use the new Datasets API, which you can use to write complex input pipeline from simple, reusable pieces.

At the time of writing (v1.3) this API is in contrib. It's most likely moving into core in v1.4, which is good news. Using TensorFlow 1.4, the below can be written using regular Python code to parse the CSV file, via the Datasets.from_generator() method. This improves producivity a lot - it means you can use Python to read, parse, and apply whatever logic you wish to your input data - then you can take advantage of the reusable pieces of the Datasets API (e.g., batch, shuffle, repeat, etc) - as well as the optional performance tuning (e.g., prefetch, parallel process, etc).

In combination with Estimators, this means you can train and tune deep models at scale on data of almost any size, entirely using a high-level API. I'll update this notebook after v1.4 is released with an example. It's neat.

In [32]:
# I'm going to reset the notebook to show you how to do this from a clean slate
%reset -f 

import collections
import tensorflow as tf

census_train_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
census_test_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'
census_train_path = tf.contrib.keras.utils.get_file('census.train', census_train_url)
census_test_path = tf.contrib.keras.utils.get_file('census.test', census_test_url)
In [33]:
# Provide default values for each of the CSV columns
# and a header at the same time.
csv_defaults = collections.OrderedDict([
  ('age',[0]),
  ('workclass',['']),
  ('fnlwgt',[0]),
  ('education',['']),
  ('education-num',[0]),
  ('marital-status',['']),
  ('occupation',['']),
  ('relationship',['']),
  ('race',['']),
  ('sex',['']),
  ('capital-gain',[0]),
  ('capital-loss',[0]),
  ('hours-per-week',[0]),
  ('native-country',['']),
  ('income',['']),
])
In [34]:
# Decode a line from the CSV.
def csv_decoder(line):
    """Convert a CSV row to a dictonary of features."""
    parsed = tf.decode_csv(line, list(csv_defaults.values()))
    return dict(zip(csv_defaults.keys(), parsed))

# The train file has an extra empty line at the end.
# We'll use this method to filter that out.
def filter_empty_lines(line):
    return tf.not_equal(tf.size(tf.string_split([line], ',').values), 0)

def create_train_input_fn(path):
    def input_fn():    
        dataset = (
            tf.contrib.data.TextLineDataset(path)  # create a dataset from a file
                .filter(filter_empty_lines)  # ignore empty lines
                .map(csv_decoder)  # parse each row
                .shuffle(buffer_size=1000)  # shuffle the dataset
                .repeat()  # repeate indefinitely
                .batch(32)) # batch the data

        # create iterator
        columns = dataset.make_one_shot_iterator().get_next()
        
        # separate the label and convert it to true/false
        income = tf.equal(columns.pop('income')," >50K") 
        return columns, income
    return input_fn

def create_test_input_fn(path):
    def input_fn():    
        dataset = (
            tf.contrib.data.TextLineDataset(path)
                .skip(1) # The test file has a strange first line, we want to ignore this.
                .filter(filter_empty_lines)
                .map(csv_decoder)
                .batch(32))

        # create iterator
        columns = dataset.make_one_shot_iterator().get_next()
        
        # separate the label and convert it to true/false
        income = tf.equal(columns.pop('income')," >50K") 
        return columns, income
    return input_fn

Here's code you can use test the Dataset input functions

In [35]:
train_input_fn = create_train_input_fn(census_train_path)
next_batch = train_input_fn()

with tf.Session() as sess:
    features, label = sess.run(next_batch)
    print(features['education'])
    print(label)

    print()

    features, label = sess.run(next_batch)
    print(features['education'])
    print(label)
[b' HS-grad' b' 11th' b' Bachelors' b' 12th' b' Bachelors' b' HS-grad'
 b' Some-college' b' Some-college' b' Bachelors' b' 11th' b' HS-grad'
 b' HS-grad' b' 11th' b' Some-college' b' 7th-8th' b' Some-college'
 b' HS-grad' b' HS-grad' b' Bachelors' b' Bachelors' b' Bachelors'
 b' Assoc-acdm' b' Some-college' b' Bachelors' b' Bachelors' b' Assoc-acdm'
 b' Masters' b' HS-grad' b' HS-grad' b' 1st-4th' b' Bachelors' b' HS-grad']
[False False False False False False False False  True False False  True
 False False False False False False  True  True False  True False False
 False  True  True False False False  True False]

[b' Bachelors' b' Some-college' b' HS-grad' b' HS-grad' b' Bachelors'
 b' HS-grad' b' Masters' b' Bachelors' b' Some-college' b' HS-grad'
 b' Some-college' b' HS-grad' b' Some-college' b' Some-college' b' HS-grad'
 b' HS-grad' b' Some-college' b' HS-grad' b' Bachelors' b' Some-college'
 b' Bachelors' b' Some-college' b' Bachelors' b' HS-grad' b' HS-grad'
 b' Some-college' b' Some-college' b' Masters' b' HS-grad' b' HS-grad'
 b' Bachelors' b' HS-grad']
[False False False False False False False  True False False  True False
 False False False False False False  True False False  True  True False
 False False False  True False False  True False]

From here, you can use the input functions to train and evaluate your Estimators. I'll add some minimal code to do this, just to show the mechanics.

In [36]:
train_input_fn = create_train_input_fn(census_train_path)
test_input_fn = create_train_input_fn(census_test_path)

feature_columns = [
    tf.feature_column.numeric_column('age'),
]

estimator = tf.estimator.DNNClassifier(hidden_units=[256, 128, 64], 
                                       feature_columns=feature_columns, 
                                       n_classes=2, 
                                       # creating a new folder in case you haven't cleared 
                                       # the old one yet
                                       model_dir='graphs_datasets/dnn')

estimator.train(train_input_fn, steps=100)
estimator.evaluate(train_input_fn, steps=100)
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_summary_steps': 100, '_tf_random_seed': 1, '_session_config': None, '_model_dir': 'graphs_datasets/dnn', '_keep_checkpoint_max': 5, '_log_step_count_steps': 100, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_secs': 600}
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into graphs_datasets/dnn\model.ckpt.
INFO:tensorflow:step = 1, loss = 60.496
INFO:tensorflow:Saving checkpoints for 100 into graphs_datasets/dnn\model.ckpt.
INFO:tensorflow:Loss for final step: 19.6795.
WARNING:tensorflow:Casting <dtype: 'float32'> labels to bool.
WARNING:tensorflow:Casting <dtype: 'float32'> labels to bool.
INFO:tensorflow:Starting evaluation at 2017-10-31-22:15:49
INFO:tensorflow:Restoring parameters from graphs_datasets/dnn\model.ckpt-100
INFO:tensorflow:Evaluation [1/100]
INFO:tensorflow:Evaluation [2/100]
INFO:tensorflow:Evaluation [3/100]
INFO:tensorflow:Evaluation [4/100]
INFO:tensorflow:Evaluation [5/100]
INFO:tensorflow:Evaluation [6/100]
INFO:tensorflow:Evaluation [7/100]
INFO:tensorflow:Evaluation [8/100]
INFO:tensorflow:Evaluation [9/100]
INFO:tensorflow:Evaluation [10/100]
INFO:tensorflow:Evaluation [11/100]
INFO:tensorflow:Evaluation [12/100]
INFO:tensorflow:Evaluation [13/100]
INFO:tensorflow:Evaluation [14/100]
INFO:tensorflow:Evaluation [15/100]
INFO:tensorflow:Evaluation [16/100]
INFO:tensorflow:Evaluation [17/100]
INFO:tensorflow:Evaluation [18/100]
INFO:tensorflow:Evaluation [19/100]
INFO:tensorflow:Evaluation [20/100]
INFO:tensorflow:Evaluation [21/100]
INFO:tensorflow:Evaluation [22/100]
INFO:tensorflow:Evaluation [23/100]
INFO:tensorflow:Evaluation [24/100]
INFO:tensorflow:Evaluation [25/100]
INFO:tensorflow:Evaluation [26/100]
INFO:tensorflow:Evaluation [27/100]
INFO:tensorflow:Evaluation [28/100]
INFO:tensorflow:Evaluation [29/100]
INFO:tensorflow:Evaluation [30/100]
INFO:tensorflow:Evaluation [31/100]
INFO:tensorflow:Evaluation [32/100]
INFO:tensorflow:Evaluation [33/100]
INFO:tensorflow:Evaluation [34/100]
INFO:tensorflow:Evaluation [35/100]
INFO:tensorflow:Evaluation [36/100]
INFO:tensorflow:Evaluation [37/100]
INFO:tensorflow:Evaluation [38/100]
INFO:tensorflow:Evaluation [39/100]
INFO:tensorflow:Evaluation [40/100]
INFO:tensorflow:Evaluation [41/100]
INFO:tensorflow:Evaluation [42/100]
INFO:tensorflow:Evaluation [43/100]
INFO:tensorflow:Evaluation [44/100]
INFO:tensorflow:Evaluation [45/100]
INFO:tensorflow:Evaluation [46/100]
INFO:tensorflow:Evaluation [47/100]
INFO:tensorflow:Evaluation [48/100]
INFO:tensorflow:Evaluation [49/100]
INFO:tensorflow:Evaluation [50/100]
INFO:tensorflow:Evaluation [51/100]
INFO:tensorflow:Evaluation [52/100]
INFO:tensorflow:Evaluation [53/100]
INFO:tensorflow:Evaluation [54/100]
INFO:tensorflow:Evaluation [55/100]
INFO:tensorflow:Evaluation [56/100]
INFO:tensorflow:Evaluation [57/100]
INFO:tensorflow:Evaluation [58/100]
INFO:tensorflow:Evaluation [59/100]
INFO:tensorflow:Evaluation [60/100]
INFO:tensorflow:Evaluation [61/100]
INFO:tensorflow:Evaluation [62/100]
INFO:tensorflow:Evaluation [63/100]
INFO:tensorflow:Evaluation [64/100]
INFO:tensorflow:Evaluation [65/100]
INFO:tensorflow:Evaluation [66/100]
INFO:tensorflow:Evaluation [67/100]
INFO:tensorflow:Evaluation [68/100]
INFO:tensorflow:Evaluation [69/100]
INFO:tensorflow:Evaluation [70/100]
INFO:tensorflow:Evaluation [71/100]
INFO:tensorflow:Evaluation [72/100]
INFO:tensorflow:Evaluation [73/100]
INFO:tensorflow:Evaluation [74/100]
INFO:tensorflow:Evaluation [75/100]
INFO:tensorflow:Evaluation [76/100]
INFO:tensorflow:Evaluation [77/100]
INFO:tensorflow:Evaluation [78/100]
INFO:tensorflow:Evaluation [79/100]
INFO:tensorflow:Evaluation [80/100]
INFO:tensorflow:Evaluation [81/100]
INFO:tensorflow:Evaluation [82/100]
INFO:tensorflow:Evaluation [83/100]
INFO:tensorflow:Evaluation [84/100]
INFO:tensorflow:Evaluation [85/100]
INFO:tensorflow:Evaluation [86/100]
INFO:tensorflow:Evaluation [87/100]
INFO:tensorflow:Evaluation [88/100]
INFO:tensorflow:Evaluation [89/100]
INFO:tensorflow:Evaluation [90/100]
INFO:tensorflow:Evaluation [91/100]
INFO:tensorflow:Evaluation [92/100]
INFO:tensorflow:Evaluation [93/100]
INFO:tensorflow:Evaluation [94/100]
INFO:tensorflow:Evaluation [95/100]
INFO:tensorflow:Evaluation [96/100]
INFO:tensorflow:Evaluation [97/100]
INFO:tensorflow:Evaluation [98/100]
INFO:tensorflow:Evaluation [99/100]
INFO:tensorflow:Evaluation [100/100]
INFO:tensorflow:Finished evaluation at 2017-10-31-22:15:51
INFO:tensorflow:Saving dict for global step 100: accuracy = 0.75875, accuracy_baseline = 0.75875, auc = 0.325595, auc_precision_recall = 0.169532, average_loss = 0.566001, global_step = 100, label/mean = 0.24125, loss = 18.112, prediction/mean = 0.277364
Out[36]:
{'accuracy': 0.75875002,
 'accuracy_baseline': 0.75875002,
 'auc': 0.32559526,
 'auc_precision_recall': 0.16953155,
 'average_loss': 0.56600058,
 'global_step': 100,
 'label/mean': 0.24124999,
 'loss': 18.112019,
 'prediction/mean': 0.27736449}

This would be a good time to clean up the logs and checkpoints on disk, by deleting ./graphs and ./graphs_datasets.

Next steps

To learn more about feature engineering

Check out the Wide and Deep tutorial. Also, see that tutorial for another kind of Estimator you can try that combines the Linear and Deep models.

To learn more about Datasets

Check out the programmers guide, and check back after v1.4 is released for the Dataset.from_generator method, which I think will improve productivity a lot.

In [ ]: